knitr::opts_chunk$set(warning=FALSE, message=FALSE)

1 Welcome

data gathered from https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

2 Cleaning the data

Here I am doing a few things:

options(stringsAsFactors=FALSE)
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(plotly)
library(highcharter)

#function for grabbing status from my filenames
getStatus<- function(x){
  strsplit(x, "-")[[1]] %>%
    last() %>%
    gsub(pattern="\\.csv", replacement="")
}

#function created for adding an active status
createActive <- function(x){
  dcast(Country.Region + Date ~ Status,
        data=x, value.var="Value") %>%
    mutate(Active = Confirmed - (Deaths + Recovered)) %>% 
    melt(id = c("Country.Region", "Date"))
}

#function used to convert a dataset from long to wide
convert <- function(x){ dcast(Country + Date ~ Status,
        data=x, value.var="Value")}

###########################################
### This is where I start cleaning the data
files <- list.files("raw", full.names = TRUE)

raw <- files %>% #read in files
  lapply(function(x){
  read.csv(x) %>% 
    mutate(
      Date = as.Date(Date, "%m/%d/%Y"),
      Status = getStatus(x) #add status
    )
}) %>% 
  bind_rows() %>% #combine files
  subset(!(Value==0) )

raw <- raw %>% #combine countries into one
  group_by(Country.Region, Date, Status) %>% 
  summarise(
    Value = sum(Value)
  )

raw <- raw %>% #get global stats
  group_by(Date, Status) %>% 
  summarise(
    Value = sum(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Country.Region = "Global"
  ) %>% 
  bind_rows(raw)

raw <- raw %>% #create active columns, delete nulls, rename
  createActive() %>% 
  subset(!is.na(value)) %>% 
  rename(
   Country = Country.Region,
    Value = value,
    Status = variable
  )

total <- raw %>% #Get total values for seperate df
  group_by(Country, Status) %>% 
  summarise(
    Value = max(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

#get change per day
raw <- raw %>%
  group_by(Country, Status) %>% 
  mutate(
    Change =  Value - lag(Value, k=1),
    Rate_Change =  (Value - lag(Value, k=1))/lag(Value, k=1) 
    ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

Now that the data is clean, lets start digging into the data!

3 A Quick Analysis

Lets start with looking at the overall feel of our top 5 Countries.

3.1 Cumulative Total Cases

The above graph shows the differences of cumulative cases by type, and by country. We can see some pretty interesting things here. For example, Italy has the highest deaths and active cases, yet they are only second in the number of confirmed cases. This could mean a few things:

  • Italy may not have been testing enough people, which would start to bring their Mortality Rate down.

  • Italys’ healthcare system is getting overrun, and their deaths are about to skyrocket.

This graph perfectly shows just how many Deaths Italy has compared to our other four countries. At the time of writing this, Italy has just about doubled their death count over Chinas’!

For the remainder of this analysis, I will be focusing on Italy.

3.2 Rates of Change

Theres a couple things to keep in mind when looking at the rate of change. For example,

  • It has more variance in the beginning.

  • As the number of cases get larger, the rate will tend to even out, or even decline.

This is the basic Exponential growth function:

\(f(t) = P_0(1+r)^t\)

The rate is what gets exponentiated for the function, ie, if our rate is higher, our confirmed cases will get much higher with each day. Also, if the rate is lower that means our cases per day has dropped from the previous days’ number. If there starts to be a trend of consecutively lower rates of change, this means that the disease is starting to slow down, which be indicitive of being past the peak.

This graph clearly shows that the deaths rate of change has not only a higher average, but most of the rates are higher than the rate of confirmed. This tells us the number of deaths are going up much faster than our confirmed cases are.

There are a multitude of reasons why this could happen:

  • The Confirmed cases are lagging behind, due to the incubation period.

  • The Confirmed cases are innacurate, as finding this true value is difficult.

  • Our Mortality rate is not static,

    • This is a likely scenario for a few reasons,
      • The healthcare systems can be overrun, causing more deaths.
      • Population ages vary depending on location, and we know Covid-19 has a higher Mortality rate among an older population, -As well as much more unknown factors.
  • The Rates of change will tend to even out over time, while will really be shown in the Changes of the Rates themselves.

    • Meaning in the beginning stages of the rate of change, it will start with a lot of extreme jumps.
      • For example, if we have 1 person infected today, and another tomorrow, for a total of two Deaths, the rate of change from day 1 to day 2 will be 1.0.
      • Also, if the next day has no Deaths reported, then the rate of change will stay at 1.0, however the change in that rate per day will be 0.

This will be important for how we build the Monte Carlo Algorithm.

3.3 Deeper look at Italys’ Death Rate of Change

Looking at this scatter plot, as the number of Deceased cases rise, the Rate of Change trends downwards. This means that Italy has a negative correlation between the actual change in cases per day and the rate of change. This is a good thing, as it could mean a few things:

  • The infection rate is starting to slow down.
  • The rate of change is starting to average out to a lower number.

Lets find out how well correlated the two actually are, using pearsons r method.

Before we start, I will get the outliers out of the data, as some of the early rates as those are innacurate, for the reasons described before.

Now that that is done, lets check to see if our Rate of Change distribution is relatively normal.

3.4 Shapiro-Wilk Test

Lets use a shapiro Wilk Test to check its normality. But first, I will explain what the test is:

*The null-hypothesis of this test is that the population is normally distributed. + Thus, if the p value is less than the chosen alpha level, - then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. -likewise, if the p value is greater than the alpha level, then the data is normally distributed!

  • It is regarded as being the best test for checking the significance the above null Hypothesis;
    • fun fact: this was tested by Monte Carlo Simulations.
## 
##  Shapiro-Wilk normality test
## 
## data:  std_corr$Rate_Change
## W = 0.94503, p-value = 0.1483

Awesome! this tells us that the p value is 0.1483 which is greater than 0.05 (a 95% confidence interval)!

This is great news! As it means that the null hypothesis is rejected! Telling us two things: * The Rate of Change Distribution is not significantly different from a normal distribution. * The W value is 0.945 which is less than 1, but only by 0.05497, - This means that the slope of its QQ Plot is relatively close to 1, telling us this data is really close to a normal distribution.

In summary, this means that the Rate of Change, in the Deaths, can be used as a predictor for the Monte Carlo Simulator!

3.5 Pearsons r and r2

Lets take a look into pearsons r

As expected there is a negative correlation between the Rate of Change and Total cases, aswell as a negative correlation between our Rate of Change and the cases per day.

This makes sense as the amount of cases increase, our rate should start having less variations. This is also great news, as it means our rate is trending downwards, ultimately suggesting that the rate of Deaths per day is slowing down.

The r squared value can tell us much more, for example, we know that 88.7% of our Cases per day can be explained by our Total cases! This also tells us that 29.4% of our Cumulative cases can be explained by the variations in the Rate of Change.

Let’s check if these values are statistically significant, to do that we need the below formula to get our test statistic.

\({ts} = \frac{r*\sqrt[2]{n-2}}{\sqrt[2]{1-r^2}}\)

##                 Value    Change Rate_Change D_RC_C
## Value             Inf 14.245784    3.355033     NA
## Change      14.245784       Inf    2.206968     NA
## Rate_Change  3.355033  2.206968         Inf     NA
## D_RC_C             NA        NA          NA    Inf

These are our test statistics for each r value, now we need to use the below formula to check if the r value is statistically significant.

\({-ts} > r > ts\)

Here is a matrix of our r values.

##                  Value     Change Rate_Change D_RC_C
## Value        1.0000000  0.9394569  -0.5424324     NA
## Change       0.9394569  1.0000000  -0.3909310     NA
## Rate_Change -0.5424324 -0.3909310   1.0000000     NA
## D_RC_C              NA         NA          NA      1

This tells us that our all our correlations are statistically significant! Which means that we can use all of our factors as a predictor of the other.

I will be using the correlation between Italys’ Deaths rate of change to predict our Deaths per day values, as this is a statistically significant correlation, and the math lines up.

4 Building a model of Italys’ Deaths per day

4.1 Facts before the build

There are two things I want to point out about Covid-19:

  • We know the median incubation period is 5 days.
    • This means that there is most likely a 5 day lag in our confirmed cases, as 50% of infected individuals will not experience any symptoms before the incubation period.
  • We know the average days from illness onset to death is 18 days.
    • This can help us estimate the amount of infections 23 days prior.

Using these two facts, we can estimate the average time from point of infection to death is 23 days (5 incubation days + 18 days till death). This means that if someone dies today from Covid-19 we can estimate that they got infected 23 days prior.

sources:

https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19?source=history_widget

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30566-3/fulltext

I want to first build a model of Italys’ predicted Deaths per day, this can further help us find the actual number of confirmed cases!

4.2 Building the predicted Deaths per day

There are a few things I want to note:

  • The reported Deaths are likely to be the most accurate.
    • This is due to the fact that there are likely much more confirmed cases of Covid-19 than what is reported.
      • Covid-19 is known to be transmittable by people without showing symptoms.
      • Not everyone will be tested.
  • Using the Deaths rate of change, we can estimate the future Deaths to be reported.

So lets get started!

4.2.1 The math behind using Deaths

\({Deaths}={Infected}*{MortalityRate}\)

This is our formula to find the number of Deaths from any given population of infected individuals. We are focusing on the Deaths per day, as I want a per day model.

This means we need to take the derivative of our formula, with respect to time. This leads into a two options:

  • The Mortality Rate is a constant value for the population.

  • The mortality Rate is a value that also changes with time.

\(D=Deaths\)

\(I=Infected\)

\(R=MortalityRate\)

If we take the derivative with Mortality Rate as a variable we get the following:

  • \(\frac{dD}{dt}=\frac{dI}{dt}*\frac{dD}{dI}+\frac{dR}{dt}*\frac{dD}{dR}\)

  • \(\frac{dD}{dt}=\frac{dI}{dt}*R+\frac{dR}{dt}*I\)

In plain terms, the change in Deaths is equal to the change in infections, time the Mortality Rate, plus, the change in Mortality Rate, times the infections.

There is one major problem with this, we cannot make an accurate estimate on the Mortality Rate, which means we also cannot make an accurate estimate on the change in Mortality Rate. So, while this would be the better option, I do not believe it is feasable with the data we have now, as we cannot get an accurate Mortality Rate.

However, taking the derivative with Mortality Rate as a constant gives us the following:

  • \(\frac{dD}{dt}=\frac{dI}{dt}*R\)

which also tells us..

  • \(\frac{dI}{dt}=\frac{\frac{dD}{dt}}{R}\)

In plain terms, we get that the change in Deaths is equal to the change in the infections, times the Mortality Rate. We also know that the change in infections is equal to the change in Deaths, divided by the Mortality Rate.

This means that with a constant Mortality Rate, the change in Deaths will also be the same as the change in Infections! As the rate is just scaling the numbers. This tells us we can make predictions on the change in Infections by using the change in Deaths, scaling with the Mortality Rate afterwards!

Using this information, I will build my model with Mortality Rate as a constant.

4.3 The math behind the Monte Carlo Simulation

\(f(t)=P_0*(1+r)^t\)

We are calculating this one day at a time, which means our t value will be 1, giving us this:

FIXING FIXING FIXING FIXING

  • \(f(1)=P_0*(1+r)\)

  • \(\frac{f(1)}{P_0}=1+r\)

  • \(r=\frac{f(1)}{P_0}-1\)

  • \(\Delta r=\frac{\Delta f(1)}{P_0}\)

  • \(s.t.f(0)=P_0\)

  • \(\Delta r=\frac{f(1)-f(0)}{f(0)}\)

  • \(r_0-r_1=\frac{f(1)-f(0)}{f(0)}\)

  • \(r_0=\frac{f(1)-f(0)}{f(0)}+r_1\)

  • \(f(1)=(r_1)-(r_0*P_0)+P_0\)._$

  • \(f(1)=P_0(r_0-r_1)+P_0\)

FIXING FIXING FIXING FIXING

In plain terms, the day to predict is equal to the change in our rate, times the previous day, plus the previous day. This means, if we try to predict the change in our rate, we can find our predicted Cumulative Deaths!

4.4 The Algorithm

To start, there are a few things to note about how I built this algorithm:

  • I will build it using the mean and standard deviation from the past 8 days.
    • What this means is, we will only look at the last 8 rates of change to base our next day off of.
    • Once we predict a future rate, we throw away the oldest rate and re-model our predictions!
      • This is to make sure that our model changes in accordance with our predictions, otherwise we would start to see a more linear aspect to our predictions.
  • I have split this algorithm into multiple functions to stop at certain points, in order to explain whats going on.

Lets dive in by taking a look at how our rates of change work!

5 Prediction Analysis and building

To start, we know how much our Deaths are changing per day, we can formulate what percentage this change is from our cumulative Deaths. To do this, we need the following formula:

\(r=\frac{\Delta f(t)}{P_0}\)

In plain terms, this formula takes the change per day of our deaths, and divides that by our previous days cumulative Deaths! Making the deaths per day into a percentage of our cumulative deaths!

##          Date Deaths Deaths_PD Deaths_RC
## 27 2020-03-19   3405       427 0.1433848
## 28 2020-03-20   4032       627 0.1841410
## 29 2020-03-21   4825       793 0.1966766
## 30 2020-03-22   5476       651 0.1349223
## 31 2020-03-23   6077       601 0.1097516

Now that this step is complete, we can calculate the change in our death rate of change. Essentially looking at how much our rates are fluctiating over time. We can do this simply by taking our first deaths rate of change and subtracting the second days rate of change from that. This will mean if the rate goes up, the change will go up as well!

##          Date Deaths Deaths_PD Deaths_RC      D_RC_C
## 27 2020-03-19   3405       427 0.1433848 -0.04638745
## 28 2020-03-20   4032       627 0.1841410  0.04075615
## 29 2020-03-21   4825       793 0.1966766  0.01253562
## 30 2020-03-22   5476       651 0.1349223 -0.06175431
## 31 2020-03-23   6077       601 0.1097516 -0.02517064

Great, here are Italys’ cumulative Deaths, Deaths per day, Deaths rate of change, and the change in the rate of change. We will use all this information to work backwards at the end, to fill in the predictions.

This is the distribution of the rate of change, we want to take a look at this to make sure it is relatively normal, otherwise there will be a different method needed to make predictions from.

## 
##  Shapiro-Wilk normality test
## 
## data:  italy$D_RC_C
## W = 0.89634, p-value = 0.2677

Exactly what we wanted to see, our p value is > 0.05, meaning our distribution is normal enough to find the probabilities!

5.1 Building the predictions

Now that we have the changes per day, we can build the prediction model. I will break this down into steps:

  • Use a changing mean and standard deviation:

    • Think of this as a revolving door, we only want to look at the past 8 days’ changes.
  • Generate a random number for each day, per trial

  • Calculate backwards to get the Cumulative Deaths

### functions used inside function below
get_rc <- function(death_rc_n1, change){
  #rc = lag death_rc + change
  rc=death_rc_n1+change
  return(rc)
}

get_pd <- function(deaths_n1, deaths_rc){
  #lag death*roc = pd
  
  pd = (deaths_rc*deaths_n1)
  if (pd <0){
    pd = pd * -1
  }
  
  return(ceiling(pd))
}

get_death <- function(lag_death = lag_death, dpd){
  #lag death + death per dat = next cumulative death
  if(dpd <0){
    dpd <- dpd * -1
  }
  
  death = lag_death+dpd
  return(ceiling(death))
}

### function used for prediction model
prediction_model <- function(data, days, trials, split, curr_d){
  curr_d = max(data[data$D_RC_C != 0,"Date"])
  
  for (n in 1:trials){
    temp <- data %>% 
      subset(select = c(Date, D_RC_C))

    for (i in 1:days){
      
      mu <- temp %>% #grabbing mean
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        mean()
      
      sd <- temp %>% #grabbing standard deviation
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        sd()
      
      set.seed(i*13*n)
      temp <- temp %>% 
        bind_rows(data.frame(Date = as.Date(curr_d, format = "%Y-%m-%d")+i, D_RC_C = rnorm(1, mu, sd))) %>% 
        mutate(
          Trial = n
        )
    }
    
    if (n == 1){
      trial_data <- temp
    }
    else{
      trial_data <- temp %>% 
        bind_rows(trial_data)
    }
  }
  
  return (trial_data)
}
curr_d <- max(italy[italy$D_RC_C != 0,"Date"])

TRIALS <- 20
DAYS <- 10
SPLIT <- 8

italy <- italy %>% 
  prediction_model(days = DAYS, trials = TRIALS, split = SPLIT, curr_d) %>% 
  group_by(Trial, Date) %>% 
  slice(n()) %>% 
  ungroup() %>% 
  mutate(
    Deaths = NA,
    Deaths_PD = NA,
    Deaths_RC = NA,
    Country = "Italy"
  )

6 Taking a look at our Simulations Results!

As shown, these are the 20 simulations for a 10 day forecast. Lets take a deeper look.

Here we can see the clustering of our Simulations. It seems the farther the forecast, the less clustered it becomes. This means that the Variance is growing as we predict the days, something that is expected as this is using a Markov Technique.

Lets take a look at the distribution of these indivdual days!

This shows what exactly is going on with each day that the algorithm predicts. As the number of predicted days increase, the distribution of our Cumulative Deaths flatten out. This means that there is more variation, or ‘less accuracy’, with each day that the algorithm predicts. This tells us that it will be more adventageous to use this algorithm as a short term predictor, rather than a long term predictor.

Now lets take a look our individual rates!

Now we can see that there is a clear split between rates that are going downwards and rates that are climbing upwards! This is great news, as it can generally imply that the Rate of Change will start to climb down, suggesting that what has been going on in the past 8 days is directly effecting the rate at which Covid-19 is spreading!

6.1 Running different hyperparameters

One thing that is possible with this algorithm, is changing the hyperparameters. I am going to import a dataset of this algorithm ran with more trials and different hyperparameters. I want to specifically limit the range to a 6 day forecast.

Here we can see that when we use a revolving mean and standard deviation, the results can vary vastly depending on the chosen window. In the above chart, the variation of the data seems to explode when anything over 10 is chosen. This is most likely due to the fact that the rates were vastly different in the beginning stages of the spread, as the overall numbers are lower. This suggests that we should be looking at a window that is less than 10, for better results.

There is one thing I want to point out, when observing the 30 day window, the median Rate of Change is the lowest one out of all other options, for all except the first day. This suggests, that although the variance between the trials is higher, the median is still converging to a lower number.

Lets take a look deeper look at the simulations with windows that are less than 10.

It seems that nearly every simulation showed a median change that is negative. This would highly indicate that the Change in Rates is slowing down, meaning that the Rate of Deaths is overall slowing down.

6.2 Comparing the Simulation to Real Life

lets take a look at the mean Deaths for the Trials, and compare that with the actual data we now have for italy!

This chart shows the actual Cumulative deaths for March 24 - 29, as well as the Predicted values. This tells us a few things about the simulations:

  • The closest hyperparameter to match the actual data is the Window of 9 days.
    • Simulating with an error within 200 Deaths per day!
  • Our predictions seem to overall get worse over time
    • Again, this suggests that our model should only be used for short term predictions.

7 Conclusion